The goal of this project is to assist us in better understanding customer behavior in a mall through membership cards. The data includes basic information about customers such as their ID, age, gender, annual income, and spending score. Customers are assigned a spending score based on their purchasing data. To do so, we use an unsupervised ML technique known as the KMeans clustering algorithm to group the customers. By better understanding its customers, the mall can design its marketing approach to boost its revenue.
## Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import plotly.express as px
## Import Dataset
df_customers = pd.read_csv("Mall_Customers.csv")
df_customers
| CustomerID | Gender | Age | Annual Income (k$) | Spending Score (1-100) | |
|---|---|---|---|---|---|
| 0 | 1 | Male | 19 | 15 | 39 |
| 1 | 2 | Male | 21 | 15 | 81 |
| 2 | 3 | Female | 20 | 16 | 6 |
| 3 | 4 | Female | 23 | 16 | 77 |
| 4 | 5 | Female | 31 | 17 | 40 |
| ... | ... | ... | ... | ... | ... |
| 195 | 196 | Female | 35 | 120 | 79 |
| 196 | 197 | Female | 45 | 126 | 28 |
| 197 | 198 | Male | 32 | 126 | 74 |
| 198 | 199 | Male | 32 | 137 | 18 |
| 199 | 200 | Male | 30 | 137 | 83 |
200 rows × 5 columns
df_customers.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 200 non-null int64 1 Gender 200 non-null object 2 Age 200 non-null int64 3 Annual Income (k$) 200 non-null int64 4 Spending Score (1-100) 200 non-null int64 dtypes: int64(4), object(1) memory usage: 7.9+ KB
## Change Gender Variable to Numeric
df_customers["Gender"].replace(["Male","Female"], [0, 1], inplace = True) # 0 : male, 1 : female
df_customers.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 200 non-null int64 1 Gender 200 non-null int64 2 Age 200 non-null int64 3 Annual Income (k$) 200 non-null int64 4 Spending Score (1-100) 200 non-null int64 dtypes: int64(5) memory usage: 7.9 KB
There are five variables (CustomerID, Gender, Age, Income and Spending Score) in our dataset for 200 customers observed. Each variable provide basic information about the customers.
df_customers.describe()
| CustomerID | Gender | Age | Annual Income (k$) | Spending Score (1-100) | |
|---|---|---|---|---|---|
| count | 200.000000 | 200.000000 | 200.000000 | 200.000000 | 200.000000 |
| mean | 100.500000 | 0.560000 | 38.850000 | 60.560000 | 50.200000 |
| std | 57.879185 | 0.497633 | 13.969007 | 26.264721 | 25.823522 |
| min | 1.000000 | 0.000000 | 18.000000 | 15.000000 | 1.000000 |
| 25% | 50.750000 | 0.000000 | 28.750000 | 41.500000 | 34.750000 |
| 50% | 100.500000 | 1.000000 | 36.000000 | 61.500000 | 50.000000 |
| 75% | 150.250000 | 1.000000 | 49.000000 | 78.000000 | 73.000000 |
| max | 200.000000 | 1.000000 | 70.000000 | 137.000000 | 99.000000 |
## Correlation of variables
plt.figure()
s = sns.heatmap(df_customers.corr(),
annot = True,
cmap = "rocket")
s.set_xticklabels(s.get_xticklabels(), rotation=90)
plt.show()
From the above correlation plot, we can see that the correlations between most of the variables are almost negligible. There is a low negative correlation between age and spending score, and a very high correlation between customerID and annual income. Let's analyze these two relationships with the help of the scatter plot.
## Relationship between age and spending score.
plt.scatter(df_customers["Age"], df_customers["Spending Score (1-100)"])
plt.xlabel("Age")
plt.ylabel("Spending Score (1-100)")
Text(0, 0.5, 'Spending Score (1-100)')
From the above plot, we can see that customers who are older than 40 tend to spend less and have a lower spending score as compared to those younger than 40. Spending declines with age, which is why there is a negative correlation coefficient between age and spending score.
## Relationship between customerID and annual income.
plt.scatter(df_customers["CustomerID"], df_customers["Annual Income (k$)"])
plt.xlabel("CustomerID")
plt.ylabel("Annual Income (k$)")
Text(0, 0.5, 'Annual Income (k$)')
The plot above shows that customerID increases with annual income of customers. There is a very strong correlation between the two variables. However, there is no causal relationship between the two variables. Hence, it's a spurious correlation. CustomerID is just a unique identifier for each customer in our dataset.
We use only three variables for the rest of our analysis. Age, annual income and spending score.
X = df_customers[["Age", "Annual Income (k$)", "Spending Score (1-100)" ]].values
Machine learning algorithms such as the gradient based methods (linear regression, logistic regression, principal component analysis, neural networks, etc.) and distance-based methods (KNN, K-means clustering, and support vector machines) require data to be scaled. Scaling involves converting values of each variable so that they are all on a similar numerical scale. Since we want to preserve the relationships between datapoints, we use standardized data is used for the analysis. Here's the formula for standardization: $$X' = \frac{X-\mu}{\sigma}$$
## Standardize Dataset
scaler = StandardScaler()
X_std = scaler.fit_transform(X) # scaled data
K-means is a simple unsupervised learning algorithm for clustering. With K-Means clustering we can detect patterns in our dataset and group together similar data points.
Our goal is to assign each data point to a cluster. Nearby points should be assigned to the same cluster. Formally, the task of clustering is to take a set of points as input and return a partitioning of the points into K clusters.
Algorithm
Step 1: Pick $K$ random points as cluster centers (or means).
Step 2: Assign data points to closest cluster center. Then change the cluster center to the average of its assigned points.
Step 3: Repeat step 2 until convergence.
With this algorithm, you have to decide beforehand how many clusters you expect to create. We pick the right K using the elbow method. Another metric is the silhouette score. This is a measure of how similar a data point is within a cluster (cohesion) compared to other clusters (separation). Pick the number of clusters, or K, with the silhouette score closest to 1.
## Use the elbow method to find the optimal number of clusters
import numpy as np
results = [] # Within cluster sum of squares (WCSS)
s_avg = [] # Silhouette score
for i in range(2,11):
k_means = KMeans(n_clusters = i, init = "k-means++", n_init = "auto", random_state = 42)
k_means.fit(X_std)
results.append(k_means.inertia_)
#cluster_labels = kmeans.fit_predict(X_std)
silhouette_avg = silhouette_score(X_std, k_means.labels_)
s_avg.append(silhouette_avg)
print(f'For n_clusters = {i}, the average silhouette score is : {round(silhouette_avg, 2)}')
plt.plot(range(2, 11), results, marker = 'o', linestyle = '--')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
# Add a vertical line indicating the number of clusters
n_clusters = 5
plt.axvline(x=n_clusters, linestyle='--', color='red', label=f'Number of Clusters: {n_clusters}')
plt.annotate(f'Optimal Number of Clusters: {n_clusters}', xy=(n_clusters, results[np.argmax(s_avg)]),fontsize=9)
plt.legend()
plt.show()
For n_clusters = 2, the average silhouette score is : 0.34 For n_clusters = 3, the average silhouette score is : 0.35 For n_clusters = 4, the average silhouette score is : 0.34 For n_clusters = 5, the average silhouette score is : 0.42 For n_clusters = 6, the average silhouette score is : 0.38 For n_clusters = 7, the average silhouette score is : 0.37 For n_clusters = 8, the average silhouette score is : 0.41 For n_clusters = 9, the average silhouette score is : 0.39 For n_clusters = 10, the average silhouette score is : 0.35
Both the elbow method and silhouette score agree on 5 clusters as the optimal number of clusters.
k_means = KMeans(n_clusters=5, init='k-means++', n_init = "auto", max_iter=100, tol=0.01, random_state=0)
k_means.fit(X_std)
KMeans(max_iter=100, n_clusters=5, n_init='auto', random_state=0, tol=0.01)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(max_iter=100, n_clusters=5, n_init='auto', random_state=0, tol=0.01)
## We create a new data frame with the original features and add a new column with the assigned clusters for each point.
X_kmeans = df_customers[["Age", "Annual Income (k$)", "Spending Score (1-100)" ]].copy()
X_kmeans["Clusters"] = k_means.labels_
centroids = k_means.cluster_centers_
## Calculate mean values for the clusters
X_analysis = X_kmeans.groupby(["Clusters"]).mean()
X_analysis
| Age | Annual Income (k$) | Spending Score (1-100) | |
|---|---|---|---|
| Clusters | |||
| 0 | 39.871795 | 86.102564 | 19.358974 |
| 1 | 55.638298 | 54.382979 | 48.851064 |
| 2 | 25.185185 | 41.092593 | 62.240741 |
| 3 | 46.250000 | 26.750000 | 18.350000 |
| 4 | 32.875000 | 86.100000 | 81.525000 |
We can see what each cluster represents by analyzing the clusters.
# Compute the size and proportions of the five clusters
X_analysis["No. of Obs"] = X_kmeans.groupby(['Clusters']).size()
X_analysis['Proportion of Obs'] = X_analysis['No. of Obs'] / X_analysis['No. of Obs'].sum()
#X_analysis[['Age', 'Annual Income (k$)']]
X_analysis = X_analysis.rename({0: "Middle-Aged High Income LS",
1: "Senior Moderate Income",
2: "Young Adult Moderate Income ",
3: "Middle-Aged Low Income",
4: "Middle-Aged High Income HS"})
## Add labels to the different segments.
X_kmeans["Labels"] = X_kmeans["Clusters"].map({
0: "Middle-Aged High Income LS",
1: "Senior Moderate Income",
2: "Young Moderate Income",
3: "Middle-Aged Low Income",
4: "Middle-Aged High Income HS" })
## 3D scatter plot to how the 5 different clusters.
fig = px.scatter_3d(X_kmeans, x = "Age", y = "Annual Income (k$)", z = "Spending Score (1-100)", color="Labels",
symbol="Labels", opacity=0.7, size_max=15, title = "Mall Customer Segmentation")
fig.show()
## We can to see number of customer per cluster and their spending score
fig = px.bar(X_analysis, x = X_analysis.index, y = X_analysis["No. of Obs"],
color=X_analysis["Spending Score (1-100)"],
labels={"No. of Obs":"Customers"}, height=400)
fig.show()
plt.figure(figsize=(9, 6))
fig = px.scatter(X_kmeans,
x = "Age",
y = "Annual Income (k$)",
color = X_kmeans["Labels"],
size = X_kmeans["Annual Income (k$)"]
)
fig.show()
<Figure size 900x600 with 0 Axes>
Dataset Source: Mall Customer Segmentation Data